AITopics | attention entropy

Collaborating Authors

attention entropy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Relieving the Over-Aggregating Effect in Graph Transformers

Neural Information Processing SystemsJun-23-2026, 05:26:45 GMT

Graph attention has demonstrated superior performance in graph learning tasks. However, learning from global interactions can be challenging due to the large number of nodes. In this paper, we discover a new phenomenon termed overaggregating. Over-aggregating arises when a large volume of messages is aggregated into a single node with less discrimination, leading to the dilution of the key messages and potential information loss. To address this, we propose Wideformer, a plug-and-play method for graph attention. Wideformer divides the aggregation of all nodes into parallel processes and guides the model to focus on specific subsets of these processes. The division can limit the input volume per aggregation, avoiding message dilution and reducing information loss.

machine learning, natural language, node, (16 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.28)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Communications (0.93)

Add feedback

Training-free Diffusion Model Adaptation for V ariable-Sized Text-to-Image Synthesis (Supplementary Materials)

Neural Information Processing SystemsFeb-17-2026, 13:57:06 GMT

We now investigate the relation between the attention entropy and the token number. The revised code are shown in Algorithm 1. Both of them are top-ranked parameter files for downloading. Experiments are conducted on a server with Intel(R) Xeon(R) Gold 6226R CPUs @ 2.90GHz and We conduct an text-based pairwise preference test. The screenshot is depicted in Figure 1.

artificial intelligence, machine learning, memory usage and time cost, (12 more...)

Neural Information Processing Systems

Country: Asia > China > Shanghai > Shanghai (0.04)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.72)

Add feedback

Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis

Neural Information Processing SystemsFeb-17-2026, 13:57:03 GMT

Diffusion models (DMs) have recently gained attention with state-of-the-art performance in text-to-image synthesis.

artificial intelligence, machine learning, resolution, (15 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.96)

Add feedback

AILA--First Experiments with Localist Language Models

Diederich, Joachim

arXiv.org Artificial IntelligenceNov-6-2025

This paper presents the first empirical demonstration of controllable locality in transformer language models, a novel architectural framework that enables continuous control over the degree of representation localization through a tunable locality dial parameter. Unlike traditional language models that rely exclusively on distributed representations, our approach allows dynamic interpolation between highly interpretable localist encodings and efficient distributed representations without requiring model retraining. We conducted experiments on the WikiText corpus using a two-layer transformer architecture, systematically varying the locality parameter λ across the full spectrum from 1.0 (fully localist) to 0.0 (fully distributed). Our results demonstrate that localist configurations achieve dramatically lower attention entropy, with λ = 1.0 yielding 5.36 bits compared to 7.18 bits at λ = 0.0, while maintaining substantially higher pointer fidelity scores reflecting stronger alignment with rule-specified targets. Prediction experiments reveal that intermediate locality values optimize the tradeoff between interpretability and performance, with λ = 0.6 achieving test perplexity of 4.65 and accuracy of 84.7%. These findings establish that localist language models provide a practical framework for applications in regulated domains requiring both transparency and capability, offering precise mathematical control over the interpretability-performance spectrum through explicit penalty thresholds and information-theoretic design principles.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2511.03559

Country: Oceania > Australia (0.28)

Genre: Research Report > New Finding (0.68)

Industry:

Law (1.00)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation

Zixian, Wang

arXiv.org Artificial IntelligenceNov-4-2025

Pre-trained Transformers often exhibit over-confidence in source patterns and difficulty in forming new target-domain patterns during fine-tuning. We formalize the mechanism of output saturation leading to gradient suppression through standard cross-entropy and softmax analysis, showing that gradient suppression at inflection layers confines adaptation to high-level recombination of existing features while preventing low-level reconstruction. We introduce a set of layer-wise diagnostic metrics -- attention entropy (saturation proxy), activation gradient norm, parameter gradient norm, and Delta-CKA under a shared PCA basis -- to identify inflection layers characterized by both low attention entropy and steep gradient decay. Building on these findings, we propose a diagnose-first, inject-light fine-tuning strategy: selectively inserting LoRA adapters at inflection layers to restore suppressed backward signals with minimal parameter overhead. Experiments on BERT-base transfer from SST-2 to Rotten Tomatoes under under-trained and over-trained source regimes reveal that over-trained initialization benefits from inflection-layer LoRA injection, while under-trained initialization suffers performance degradation. When base features are strong, unblocking inflection layers facilitates high-level compositional adaptation; when base features are weak, full-pathway unblocking is required for low-level reconstruction, as supported by joint analysis of layer-wise activation gradients and Delta-CKA dynamics.

artificial intelligence, inflection layer, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2511.00797

Genre: Research Report (0.65)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Relieving the Over-Aggregating Effect in Graph Transformers

Sun, Junshu, Chang, Wanxing, Yang, Chenxue, Huang, Qingming, Wang, Shuhui

arXiv.org Artificial IntelligenceOct-27-2025

Graph attention has demonstrated superior performance in graph learning tasks. However, learning from global interactions can be challenging due to the large number of nodes. In this paper, we discover a new phenomenon termed over-aggregating. Over-aggregating arises when a large volume of messages is aggregated into a single node with less discrimination, leading to the dilution of the key messages and potential information loss. To address this, we propose Wideformer, a plug-and-play method for graph attention. Wideformer divides the aggregation of all nodes into parallel processes and guides the model to focus on specific subsets of these processes. The division can limit the input volume per aggregation, avoiding message dilution and reducing information loss. The guiding step sorts and weights the aggregation outputs, prioritizing the informative messages. Evaluations show that Wideformer can effectively mitigate over-aggregating. As a result, the backbone methods can focus on the informative messages, achieving superior performance compared to baseline methods.

machine learning, natural language, node, (16 more...)

arXiv.org Artificial Intelligence

2510.21267

Country: North America > United States > California (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation

Kattamuri, Ashish, Fartale, Harshwardhan, Vats, Arpita, Raja, Rahul, Prasad, Ishita

arXiv.org Artificial IntelligenceOct-13-2025

Data contamination poses a significant challenge to reliable LLM evaluation, where models may achieve high performance by memorizing training data rather than demonstrating genuine reasoning capabilities. We introduce RADAR (Recall vs. Reasoning Detection through Activation Representation), a novel framework that leverages mechanistic interpretability to detect contamination by distinguishing recall-based from reasoning-based model responses. RADAR extracts 37 features spanning surface-level confidence trajectories and deep mechanistic properties including attention specialization, circuit dynamics, and activation flow patterns. Using an ensemble of classifiers trained on these features, RADAR achieves 93\% accuracy on a diverse evaluation set, with perfect performance on clear cases and 76.7\% accuracy on challenging ambiguous examples. This work demonstrates the potential of mechanistic interpretability for advancing LLM evaluation beyond traditional surface-level metrics.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.08931

Country: Europe (0.28)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Add feedback

Attention to Order: Transformers Discover Phase Transitions via Learnability

Özönder, Şener

arXiv.org Artificial IntelligenceOct-10-2025

Phase transitions mark qualitative reorganizations of collective behavior, yet identifying their boundaries remains challenging whenever analytic solutions are absent and conventional simulations fail. Here we introduce learnability as a universal criterion, defined as the ability of a transformer model containing attention mechanism to extract structure from microscopic states. Using self-supervised learning and Monte Carlo generated configurations of the two-dimensional Ising model, we show that ordered phases correspond to enhanced learnability, manifested in both reduced training loss and structured attention patterns, while disordered phases remain resistant to learning. Two unsupervised diagnostics, the sharp jump in training loss and the rise in attention entropy, recover the critical temperature in excellent agreement with the exact value. Our results establish learnability as a data-driven marker of phase transitions and highlight deep parallels between long-range order in condensed matter and the emergence of structure in modern language models.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.07401

Country: Asia > Middle East > Republic of Türkiye (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Rule Encoding and Compliance in Large Language Models: An Information-Theoretic Analysis

Diederich, Joachim

arXiv.org Artificial IntelligenceOct-10-2025

The design of safety-critical agents based on large language models (LLMs) requires more than simple prompt engineering. This paper presents a comprehensive information-theoretic analysis of how rule encodings in system prompts influence attention mechanisms and compliance behaviour. We demonstrate that rule formats with low syntactic entropy and highly concentrated anchors reduce attention entropy and improve pointer fidelity, but reveal a fundamental trade-off between anchor redundancy and attention entropy that previous work failed to recognize. Through formal analysis of multiple attention architectures including causal, bidirectional, local sparse, kernelized, and cross-attention mechanisms, we establish bounds on pointer fidelity and show how anchor placement strategies must account for competing fidelity and entropy objectives. Combining these insights with a dynamic rule verification architecture, we provide a formal proof that hot reloading of verified rule sets increases the asymptotic probability of compliant outputs. These findings underscore the necessity of principled anchor design and dual enforcement mechanisms to protect LLM-based agents against prompt injection attacks while maintaining compliance in evolving domains.

artificial intelligence, large language model, natural language, (13 more...)

arXiv.org Artificial Intelligence

2510.05106

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Filters

Collaborating Authors

attention entropy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Relieving the Over-Aggregating Effect in Graph Transformers

e0378e0c642b1d292fcb224e8d5a39b3-Supplemental-Conference.pdf

Training-free Diffusion Model Adaptation for V ariable-Sized Text-to-Image Synthesis (Supplementary Materials)

Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis

AILA--First Experiments with Localist Language Models

Attention Saturation and Gradient Suppression at Inflection Layers: Diagnosing and Mitigating Bottlenecks in Transformer Adaptation

Relieving the Over-Aggregating Effect in Graph Transformers

RADAR: Mechanistic Pathways for Detecting Data Contamination in LLM Evaluation

Attention to Order: Transformers Discover Phase Transitions via Learnability

Rule Encoding and Compliance in Large Language Models: An Information-Theoretic Analysis